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Ab initio RNA secondary structure predictions have long 
dismissed helices interior to loops, so-called pseudoknots, 
despite their structural importance. Here, we report that 
many pseudoknots can be predicted through long time scales 
RNA folding simulations, which follow the stochastic closing 
and opening of individual RNA helices. The numerical effi- 
cacy of these stochastic simulations relies on an 0(n 2 ) clus- 
tering algorithm which computes time averages over a con- 
tinously updated set of n reference structures. Applying this 
exact stochastic clustering approach, we typically obtain a 
5- to 100-fold simulation speed-up for RNA sequences up to 
400 bases, while the effective acceleration can be as high as 
10 5 -fold for short multistable molecules (< 150 bases). We 
performed extensive folding statistics on random and nat- 
ural RNA sequences, and found that pseudoknots are un- 
evenly distributed amongst RNA structures and account for 
up to 30% of base pairs in G+C rich RNA sequences (On- 
line RNA folding kinetics server including pseudoknots : 
http://kinefold.u-strasbg.fr/!. 

The folding of RNA transcripts is driven by intramolecular 
GC/AU/GU base pair stacking interactions. This primarily leads to the 
formation of short double-stranded RNA helices connected by unpaired 
regions. Ab initio RNA folding prediction restricted to tree-like sec- 
ondary structures is now well establishedQ|,0,|3,0,SS0Sl ar, d has 
become an important tool to study and design RNA structures which 
remain by and large refractory to many crystallization techniques. Yet, 
the accuracy of these predictions is difficult to assess -despite the pre- 
cision of stacking interaction tables|7]- due to their a priori dismissal 
of pseudoknot helices, Fig 1A. 

Pseudoknots are regular double-stranded helices which provide spe- 
cific structural rigidity to the RNA molecule by connecting different 
"branches" of its otherwise more flexible tree-like secondary structure 
(Figs 1A-B). Many ribozymes, which require a well-defined 3D en- 
zymatic shape, have pseudoknots^ El E3. 13 E3. d EE E3l 
Pseudoknots are also involved in mRNA-ribosome interactions during 
translation initiation and frameshift regulationl 18]. Still, the overall 
prevalence of pseudoknots has proved difficult to ascertain from the 
limited number of RNA structures known to date. This has recently 
motivated several attempts to include pseudoknots in RNA secondary 
structure predictions! 19, 20, 21]. 

There are two main obstacles to include pseudoknots in RNA struc- 
tures: a structural modeling problem and a computational efficiency 
issue. In the absence of data bases for pseudoknot energy parameters, 
their structural features have been modeled at various descriptive levels 
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FIG. 1: A An RNA secondary structure with pseudoknots. B Minimum 
set of helices defined as "pseudoknots" and visualized for convenience 
by colored single-stranded regions connected by two straight lines. C 
The entropic cost of the actual 3D structural constraints is evaluated by 
modeling RNA helices as stiff rods (black) and single-stranded regions 
as ideal polymer springs. Colored single-stranded circuits define quasi- 
independent structural domains referred to as "nets" in ref 12 ill . 



using polymer theorvfl9|. |2l|. |22ll . From a computational perspective, 
pseudoknots have proved not easily amenable to classical polynomial 
minimization algorithms 1 20] due to their intrinsic non-nested nature. 
Instead, simulating RNA folding dynamics has provided an alterna- 
tive avenue to predict pseudoknots 1 21 , 22] in addition to bringing some 
unique insight into the kinetic aspects of RNA foldingl 8 H2lll . 

Yet, stochastic RNA folding simulations can become relatively in- 
efficient due to the occurrence of short cycles amongst closely related 
configurations |22|], which typically differ by a few helices only. Not 
surprisingly, similar numerical pitfalls have been recurrent in stochas- 
tic simulations of other trapped dynamical systems|23, 24, 25, 26 H2tTi . 

To address this computational efficiency issue and capture the slow 
folding dynamics of RNA molecules, we have developed a generic al- 
gorithm which greatly accelerates RNA folding stochastic simulations 
by exactly clustering the main short cycles along the explored fold- 
ing paths. The general approach, which may prove useful to simulate 
other trapped dynamical systems, is discussed in the main subsection 
of Theory and Methods. In the Results section, the efficacy of these 
exactly clustered stochastic (ECS) simulations is first compared to non- 
clustered RNA folding simulations, before being used to predict the 
prevalence of pseudoknots in RNA structures on the basis of the struc- 
tural model introduced in ref 1 2 1 ] and briefly reviewed hereafter. 

Theory and Methods 

Modeling and visualizing pseudoknots in RNA structures. We 

model the 3D constraints associated with pseudoknots using polymer 
theory. The entropy costs of pseudoknots and internal, bulge and 
hairpin loops are evaluated on the same basis by modeling the sec- 
ondary structure (including pseudoknots) as an assembly of stiff rods - 
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representing the helices- connected by polymer springs -corresponding 
to the unpaired regions, Fig 1C. In practice, free energy computations 
involve the labelling of RNA structures into constitutive "nets" -shown 
as colored circuits on Fig 1C- to account for the stretching of the un- 
paired regions linking the extremities of pseudoknot helices, see ref 1 2 1 ] 
for details. In addition, free energy contributions from base pair stack- 
ings, terminal mismatches and co-axial stackings are taken from the 
thermodynamic tables measured by the Turner lab|7]. 

The main limitation of this structural model is the absence of hard- 
core interactions, which could stereochemically prohibit certain RNA 
structures with either long pseudoknots (e.g. , > 1 lbp, one helix turn) or 
a large proportion of pseudoknots (e.g., >30% of formed base pairs). 
However, we found that such stereochemically improbable structures 
account for less than l-to-10% of all predicted structures, depending 
on G+C content (see Results section). Hence, in practice, neglecting 
hardcore interactions is rarely a stringent limitation, except for a few, 
somewhat pathological cases. 

Although the presence of pseudoknots in an RNA structure is not 
associated to a unique set of helices, it is convenient for visualization 
and statistics purposes to define the set of pseudoknots as the minimum 
set of helices which should be imagined broken to obtain a tree-like 
secondary structure, Fig IB. Finding such a minimum set (with respect 
to the number of base pairs or their free energy) amounts to finding the 
maximum tree-like set amongst the formed helices and can be done in 
polynomial time using a classical "dynamic programming" algorithm. 

Modeling RNA folding dynamics and straightforward stochastic 
algorithm. RNA folding kinetics is known to proceed through rare 
stochastic openings and closings of individual RNA helices 1 28]. The 
time limiting step to transit between two structures sharing essen- 
tially all but one helix can be assigned Arrhenius-like rates, k± = 
k° x exp(— AG±/kT), where kT is the thermal energy. k°, which 
reflects only local stacking processes within a transient nucleation core, 
has been estimated from experiments on isolated stem-loops| 28](fc° ~ 
10 8 s" 1 ), while the free energy differences AG± between the transition 
states and the current configurations (Fig 2) can be evaluated by com- 
bining the stacking energy contributions and the global coarse-grained 
structural model described above, Fig 1C. 

Simulating a stochastic RNA folding pathway amounts to follow- 
ing one particular stochastic trajectory within the large combinatorial 
space of mutually compatible helices|22]. Each transition in this dis- 
crete space of RNA structures corresponds to the opening or closing 
of a single helix, possibly followed by additional helix elongation and 
shrinkage rearrangements to reach the new structure's equilibrium com- 
patible with a minimum size constraint for each formed helix| 21] (base 
pair zipping/unzipping kinetics occurs on much shorter time scales than 
helix nucleation/dissociation). For a given RNA sequence, the total 
number of possible helices (which roughly scales as L , where L is 
the sequence length) sets the local connectivity of the discrete struc- 
ture space and therefore the number of possible transitions from each 
particular structure. 

Formally, we consider the following generic model. Each structure 
or "state" i is connected to a finite, yet possibly state-to-state vary- 
ing number of neighboring configurations j via transition rates kji (the 
right-to-left matrix ordering of indices is adopted hereafter). As kji is 
the average number of transitions from state i to state j per unit time, the 
lifetime ti of configuration i corresponds to the average time before any 
transition towards a neighboring state j occurs, i.e., ti = l/Xwjv^j*- 
and the transition probability from state i to state j is pji — kjiti, with 
= 1, as expected, for all state i. 

Hence, in the straightforward stochastic algorithm lElt l22ll . each new 
transition is picked at random with probability pji while the effective 
time is incremented with the lifetime i; of the current configuration 




FIG. 2: Stochastic transitions over a thermodynamic barrier AG± to 
close and open an individual helix between two neighbor RNA struc- 
tures, i and j. Nucleation of the new helix usually involve some local 
unzipping of nearby helices at the barrier and further base pair rear- 
rangements to reach equilibrium in the new structure j [ 2 1 ] . 



i|30]. However, as mentioned in the introduction, the efficiency of this 
approach is often severely impeded by the existence of kinetic traps 
consisting of rapidly exchanging states. 

Exactly clustered stochastic (ECS) simulations. As in the case 
of RNA folding dynamics, the simulation of other trapped dynami- 
cal systems generally presents a computational efficiency issue. In 
particular, powerful numerical schemes have been developed to com- 
pute the elementary escape times from traps for a variety of simula- 
tion techniques!^ 12a. l26t l27ll . Still a pervasive problem usually 
remains for most applications due to the occurrence of short cycles 
amongst trapped states, and heuristic clustering approaches have been 
proposed to overcome these "numerical traps"| 29]. 

To capture the slow folding dynamics of RNA molecules, we have 
developed an exact stochastic algorithm which accelerates the simula- 
tion by numerically integrating the main short cycles amongst trapped 
states. This approach being quite general, it could prove useful to sim- 
ulate other small, trapped dynamical systems with coarse-grained de- 
grees of freedom. 

In a nutshell, the ECS algorithm aims at overcoming the numerical 
pitfalls of kinetic traps by "clustering" some recently explored configu- 
rations into a single, yet continuously updated cluster A of n reference 
states. These clustered configurations are then collectively revisited in 
the subsequent stochastic exploration of states. Although stochasticity 
is "lost" for the individual clustered states, its statistical properties are, 
however, exactly transposed at the scale of the set A of the n reference 
states. This is achieved as follows. For each pathway on A, a sta- 
tistical weight W Cm = Y[° m P lk ls defined, where k and / run over all 
consecutive states along from its "starting" state i to its "exiting" 
state j on A. The nxn probability matrix P A which sums the statisti- 
cal weights W Cm over all pathways on A between any two states 
i and j of A is then introduced, 

^=£^=£(]>A a) 

m:j* — i m,:j< — i j* — i 

and the exit probability to make a transition outside A from the state 
j is noted: p^ A = 1 — ^^Pkj- Hence, starting from state i, the 

probability to exit the set A at state j is p e j A P A , with J2f Pj Ap ji = 1- 
for all i of A. 
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Thus, in the ECS algorithm, one first chooses at random with proba- 
bility pj A p£ the reference state j of A from which a new transition 



towards a state k outside A will then be chosen stochastically with 
probability pkj/Pj A . Meanwhile, the physical quantities of interest, 
like the cumulative time lapse tf to exit the set A from j starting at i, 
are exactly averaged over all (future) pathways from i to j within A, as 
explained in the next subsection. Finally, the new state k is added to 
the reference set A whilst another reference state is removed, so as to 
update A, as discussed in The 0(n 2 ) algorithm subsection. 



statistical weight (J} S pikjpba (Y[ A Pl'k') ■ Its contribution to the av- 
erage time to exit somewhere from the union of A and B is, 



B A \ B A 

5> + J> \\ Plk Pba Y[ P^'k' = ( 4 ) 
j' — b a' — i j* — b a< — i 
B B v A B , A A 

JJpifclpfm ^\_PVk'+Y\_Plk Pba ( y^fe'T~|"pi'fc 
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Exact averaging over all future pathways. We start the discussion 
with the path average time lapse to exit the set A. Let us introduce 
the time lapse transform of P A : T[P A ]{t} = P A {t}, which sums the 

weighted cumulative lifetimes ( m th) Y\ m Plk over all pathways 
C„ on A between any two states i and j of A, 



E 



5> n 



pik 



(2) 



where the th's are summed over all consecutive states h -from i to 
j included- along each pathway C^. Hence, the mean time if to 
exit A from any state j of A starting from configuration i is, if = 



or in matrix form for any "direct" pathway from A to B, 

T [p B T BA p A ] = T j p flj T BA p A + pBrpBA^pA^ (5) 

which implies that applying the usual differentiation rules to any com- 
bination of probability matrices yields the correct combined path av- 
erage matrices (defining T[T BA ]ij = for all i and j). Note, this 
out-of-equilibrium calculation of path average quantities is reminiscent 
of the usual equilibrium calculation of thermal averages through differ- 
entiation of an appropriate Partition Function. Indeed, the probability 
matrices introduced here are "partition functions" over all pathways 
within a set of reference states. 

The 0(n 2 ) algorithm. With this result in mind, we can now return 
to the calculation of the probability and path average matrices P c and 



E j P e j p ji it}- However, in the context of the ECS algorithm, the pc for the union q of tWQ disjoint sets A and B 



time lapse of interest is tf it the mean time to exit A from a particular 
state j, i A = pfP A {t}/pfP A = P A {t}/P A . 

The average of any path cumulative quantity of interest Xi can be 
similarly obtained by introducing the appropriate P A {x} matrix. In 
particular, the instantaneous efficiency of the algorithm is well reflected 
by the average pathway length if between any two states of A, 



ji Pji {^}/ Pji ! 



(3) 



where P A {£} = KE° m l) IT" P«*J - with 1 cor- 

responding to the length of the pathway C A (1 is added at each state 
along each pathway C^). Hence, starting from state i, corresponds 
to the average number of transitions that would have to be performed 
by the straightforward algorithm before exiting the set A at state j. As 
expected, iff can be very large for a trapped dynamical system, which 
accounts for the efficiency of the present algorithm. Since the approach 
is exact, there is, however, no a priori requirement on the trapping con- 
dition of the states of A and the algorithm can be used continuously. 

Similarly, the time average of any physical quantity yi -like the 
pseudoknot proportion of an RNA molecule- can be calculated by 
introducing the appropriate time weighted matrix P A {yt}. For in- 
stance, the time average energy over all pathways between any two 
states i and j of A is, E A = P A {Et}/ P A {t}, where P A {Et} = 

J2 c m A ^[(J2 c - E h t h ) Yf- Pl k]. 

The actual calculation of the probability and path average matrices 
P c and P c over a set C of N states will be performed recursively in 
the next subsection. As an intermediate step, we first consider hereafter 
the unidirectional connection between two disjoint sets A and B. 

Let us hence introduce the transfer matrix T BA from set A to set B 
defined as T BA — pji, where pji is the probability to make a transition 
from state i of A to state j of B (T BA — if i and j are not connected). 
We will assume that A has n states and B m states and that their prob- 
ability and path average matrices P A , P A , P B and P B are known. 
Starting at state i of A, we find that the probability to exit on j of B 
after crossing once and only once from A to B is, Pj B (P B T BA P A )ji, 
where we have used matrix notations. Let us consider a particular path 
from i in A to j in B crossing once and only once from A to B, with 



Defining P Ab = p A T AB and P Ba = p B T BA , we readily obtain 
the probability matrix P c as an infinite summation over all possible 
pathway loops between the sets A and B (I is the identity matrix), 



pc = 



} AA Q AB 
ISA n,BB I J 



with 



Q BA Q L 

QAA = \j + pAbpBa + ^pAbpBas } 2 + ^ p A = L Ap/ 



(6) 



yBA _ pBa j^A pA 



— Ij J r p Ba p Ab + (p Ba p Abs j 2 _|_. . .j p B —L B p E 



pAbjBpB 



where L A = [/ - p Ab p Ba ] ' 1 and L B = [I - P Ba P Ab ] - 1 . 

Defining also P Ab = p A T AB and P Ba = p B p BA , we finally 
obtain the path average matrix P c from simple "differentiation" of the 
"partition function" P° , Eqs.j6}, 



Q AA Q AB 
r\BA r\BB I i 



with 



(7) 



~ L ApA + L ApA 



_ pBa j^A pA , paa j^A pA , poa j^A p 



BarAnA . r,Ba r A fti 



= L B P B +L B P B 



Q AB = pAbjBpB + pAO Lapa + p ™ L »p 

where, L A = L A (p At >p B - + P Ab p B ^ L A 
and L B = L B (P Ba P Bb + P B « P Ab ) L B 

Eqs.JSJ and are valid for any sizes n and ra of A and B. Hence P c 
and P c can be calculated recursively starting from iV isolated states 
and 2N 1 x 1 matrices P l = [1] and P l {x} = [»«], with i = 1, N, 
where Xi is the value of the feature of interest in state i. Clustering 
those states 2 by 2, then 4 by 4, etc., using Eqs.jS| and Q finally 
yields P° and P c in 0(N 3 ) operations (i.e., by matrix inversions and 
multiplications). However, instead of recalculating everything back re- 
cursively from scratch each time the set of reference states is modified, 
it turns out to be much more efficient to update it continuously each 
time a single state is added. Indeed, Eqs.JSJ and Q can b e calculated 
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in 0{n 2 ) operations only, when m = 1 and n — N — 1, as we will 
show below. Naturally, a complete update also requires the removal of 
one "old" reference state each time a "new" one is added, so as to keep 
a stationary number n of reference configurations. As we will see, this 
removal step can also be calculated in 0{n 2 ) operations only. 

The 0{n 2 ) -operation update of the reference set, which we now out- 
line, relies on the fact that T AB , P Ab and P Ab are n x 1 matrices and 
that T BA , P Ba and P Ba are lxn matrices, when m = 1 and n = N-l 
(P B and L B are simple lxl matrices for a single state B). Since 
we operate on vectors, the Sherman-Morrison formula|31] can then be 
used to calculate the n x n matrix L A = [/ - P Ab ® P Ba ] 1 
/ + P Ab ® P Ba /(\ - P- 46 ■ P Ba )] . Hence, not only L A but also 

any matrix product L A M, where Misanxn matrix, can be eval- 
uated in 0(n 2 ) operations [by first calculating P Ba M followed by 
P Ab ® (P Ba M)]. Noticing that the same reasoning applies for the 
nxn matrices P Ab ® P Ba and P Ab ® P Sa provides a simple scheme 
to add a single reference state to A and obtain matrices P c and P c in 
0(n 2 ) operations using Eqs.JSJ and 0. 

In order to achieve the reverse modification consisting in remov- 
ing one state B from the reference set C, it is useful to first imag- 
ine that the original P c and P c were obtained by the addition of 
the single state B to the n-configuration set A, as given by Eqs.j6} 
and 0. Identifying row Q BA , column Q AB and their intersection 
Q BB corresponding to the single state B readily yields the vectors 

pAb = qAB iqBB ^ pBa = pBA ^ p B = r^ ^ ^ 



nxn matrix [L J 



)/Q 



This gives the following relations between the known L A , T AB , T BA , 
Q AA , Q BB , Q BA , Q AB , P B and Q AA , and the unknown P A and P A , 



Q AA 
Q AA 



L A P A , 



P A (I + T AB ( 



which eventually provides P A and P A using the Sherman-Morrison 



formula| 3 1 ] to invert I + T 



AB 



P A = [L*]- 1 ^ = [I- 

-.AB^pBA 



Q BA , 
Q AB 



<T BA \^ AA 



P A = 



I- 
I- 



T 



Q BB 

AB „ r>BA 



~)BB 

P B 
' CtBB 



Q AB ®Q BA 



1 - T AB ■ Q BA 



(8) 



(9) 



Hence, the single state B can be removed from the set of reference C 
in 0(n 2 ) operations to yield the updated probability and path average 
matrices P A and P A . 

Note, however, that this continuous updating procedure, using alter- 
natively Eqs. <6l7t and Eqs. <8!9> in succession, is expected to become 
numerically unstable after too many updates of the reference set. For 
1 < n < 300, we have usually found that the small numerical drifts [as 
measured e.g. by e = ^ A (J^ . P^Pjt ~ I) 2 — 0] can simply be re- 
set every n th update by recalculating matrices P A and P A recursively 
from n isolated states in 0(n 3 ) operations, so as to keep the overall 
C(n 2 )-operation count per update of the reference set. 

Another important issue is the choice of the state to be removed from 
the updated reference set. Although this choice is in principle arbitrary, 
the benefit of the algorithm strongly hinges on it (for instance removing 
one of the most statistically visited reference states usually ruins the 
efficiency of the method). We have found that a "good choice" is often 
the state j* with the lowest "exit frequency" from the current state i 
[i.e., l/i A k i — min^(l/t^)], but other choices may sometimes prove 
more appropriate. 
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FIG. 3: A: Expected (grey lines) and actual (black lines) speed-up of 
the approach with respect to the straightforward algorithm (see main 
text), i: Bistable molecule in Fig 4C (with a combinatorial structure 
space of 37 possible helices); ii: 67-nt-long molecule with reverse se- 
quence of the bistable molecule in Fig 4C (38 possible helices). The 
C9(n 2 ) algorithm becomes unstable above 40 reference states in this 
case (see main text); Hi: Hepatitis delta virus ribozyme, Fig 4B (84 
possible helices); iv: average speed-up for random 100-nt-long RNA 
sequences with 50% G+C content, v: Group I intron ribozyme, Fig 4A 
(894 possible helices). B: Net speed-up distribution amongst random 
100-nt-long RNA sequences with 50% G+C content (iv on Fig 3 A) for 
a cluster of 40 reference states. 



Results 

Performance of the ECS algorithm. Before applying the ECS algo- 
rithm to investigate the prevalence of pseudoknots in RNA structures, 
we first focus on the efficacy of the approach by studying the net speed- 
up of the ECS algorithm with respect to the straightforward algorithm. 
As illustrated on Fig 3 for a few natural and artificial sequences, there 
is an actual 10 1 to 10 5 -fold increase of the ratio "simulated-time over 
CPU-time" between ECS and straightforward algorithms (black lines) 
for RNA shorter than about 150 nt, Fig 3. This improvement runs par- 
allel to the expected speed-up (grey lines) as predicted by lj it Eq.j3), 
as long as the number n of reference states is not too large (typically 
n < 50 here), so that the 0{n 2 ) update routines do not significantly 
increase the operation count as compared to the straightforward algo- 
rithm. Hence, the ECS algorithm is most efficient for small trapped 
systems (when the dynamics can be appropriately coarse-grained), al- 
though a several-fold speed-up can still be expected with somewhat 
larger systems, such as the 394-nt-long Group I intron pictured in 
Fig 4A. 

Alternatively, using this exact approach may also provide a con- 
trolled scheme to obtain approximate coarse-grained dynamics for 
larger systems. The C routines of the ECS algorithm are freely available 
upon request. 

Pseudoknot prediction and prevalence in RNA structures. In the 

context of RNA folding dynamics, the present approach can be used 
to evaluate time averages for a variety of physical features of interest, 
such as the free energy along the folding paths, the fraction of time par- 
ticular helices are formed, the extension of an RNA molecule unfold- 
ing under mechanical force|32], the end-to-end distance of a nascent 
RNA molecule during transcription, etc. Here, we report results on 
the prediction of pseudoknot prevalence in RNA structures. They have 
been obtained performing several thousands of stochastic RNA fold- 
ing simulations including pseudoknots. As explained in Theory and 
Methods, the structural constraints between pseudoknot helices and un- 
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FIG. 4: RNA structure prediction with the ECS algorithm. Structures 
are drawn using the "RNAMovies" software|33] adapted to visualize 
predicted pseudoknots. A 394-base long Tetrahymena Group I intron: 
the lowest free-energy structure found shares 80% base pair identity 
with the known 3D structure, including the two main pseudoknots, P3 
and P13|lll[ll[ll[ll[ll[T3|. B 88-base long hepatitis delta virus 
ribozyme: predicted structure shares 93% base pair identity with the 
known 3D structure, including the main pseudoknot P2|21] (but not the 
2-base pair long Pl.l| 13]); C The two structures of a bistable, 67-nt- 
long artificial RNA molecule. 



paired connecting regions are modeled using elementary polymer the- 
ory (Fig 1C,|21]) and added to the traditional base pair stacking inter- 
actions and simple loops' contributions 0]. 

We found that many pseudoknots can effectively be predicted with 
such a coarse-grained kinetic approach probing seconds to minutes 
folding time scales. No optimum "final" structure is actually predicted, 
as such, in this folding kinetic approach. Instead, low free-energy struc- 
tures are repeatedly visited, as helices stochastically form and break. 
Fig 4A represents the lowest free-energy secondary structure found for 
394-nt-long Tetrahymena Group I intron, which shows 80% base pair 
identity with the known 3D structure, including the two main pseu- 
doknots, P3 and P13(H d E d 0. A number of smaller 
known structures with pseudoknots are also compared to the lowest 
free-energy structures found with similar stochastic RNA folding sim- 
ulations in| 21]. In addition, to facilitate the study of folding dynamics 
for specific RNA sequences, we have set up an online RNA folding 
server including pseudoknots at URL http://kinefold.U-Strasbg.fr/ 

Beyond specific sequence predictions, we also investigated the gen- 
eral prevalence of pseudoknots by studying the "typical" proportion of 
pseudoknots in both random RNA sequences of increasing G+C con- 
tent (Fig 5) and in 150-nt-long mRNA fragments of the Escherichia 




c 






c 


1/3 


10% 


o 


I 




ribu 


true 




t dist 


NA s 


5% 


c 






1 


■r. 






M 

cs 




u 


c 




X 
PH 


1 


0% 



10% 20% 
Pseudoknot proporti< 



30% 



40% 



FIG. 5: Distribution of pseudoknot proportion amongst formed base 
pairs for 50-nt-long (A), 100-nt-long (B), and 150-nt-long (C) random 
sequences of increasing G+C content. Projected lines correspond to 
the average pseudoknot proportion in 50 (blue), 100 (red), and 150- 
nt-long (green) random sequences. All three average curves are dis- 
played in inset on Fig 5B. Open (and filled) symbols on Fig 5C cor- 
respond to known (and predicted) pseudoknot proportions for Tetrahy- 
mena group I intron, Fig 4A (triangles) and Hepatitis delta virus ri- 
bozyme, Fig 4B H3II21I1 (circles). 



coli and Saccharomyces cerevisiae genomes. The statistical analysis 
was done as follows: for each random and genomic sequence set, 100 
to 1000 sequences were sampled and 3 independent folding trajectories 
were simulated for each of them, using the ECS algorithm. A minimum 
duration for each trajectory was determined so that more than 80-90% 
of sequences visit the same free-energy minimum structures along their 
3 independent trajectories. The time average proportion of pseudoknots 
was then evaluated, considering this fraction of sequences having likely 
reached equilibrium (including the 10-20% of still unrelaxed sequences 
does not significantly affect global statistics). In practice, slow fold- 
ing relaxation limits extensive folding statistics to sequences up to 150 
bases and 75% G+C content, although individual folding pathways can 
still be studied for molecules up to 250 to 400 bases depending on their 
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specific G+C contents. 

The results for 50-nt-long (Fig 5A), 100-nt-long (Fig 5B), and 150- 
nt-long (Fig 5C) random sequences show, first, a broad distribution in 
pseudoknot proportion from a few percents of base pairs to more than 
30% for some G+C rich random sequences. Such a range is in fact 
compatible with the various pseudoknot contents observed in different 
known structures (e.g. see triangles and circles in Fig 5C). Second, 
the average proportion of pseudoknots (projected curves and inset in 
Fig 5B) slowly increases with G+C content, since stronger (G+C rich) 
helices are more likely to compensate for the additional entropic cost 
of forming pseudoknots. Third, and perhaps more surprisingly, this 
average proportion of pseudoknots appears roughly independent of se- 
quence length except for very short sequences with low G+C content 
(inset in Fig 5B), in contradiction with a naive combinatorial argument. 
Fourth, we found that the cooperativity of secondary structure rear- 
rangements amplifies the structural consequences of pseudoknot for- 
mation; typically, a structure with 10 helices including 1 pseudoknot 
conserves not 9 but only 7 to 8 of its initial helices (while 2 to 3 new 
nested helices commitantly form) if the single pseudoknot is excluded 
from the structure prediction. Thus, neglecting pseudoknots usually in- 
duces extended structural modifications beyond the sole pseudoknots 
themselves. 

We compared these results with the folding of 1 50-nt-long sections 
of mRNAs from the genomes of Escherichia coli (50% G+C con- 
tent) and Saccharomyces cerevisiae (yeast, 40% G+C content). These 
genomes exhibit similar broad distributions of pseudoknots, despites 
small differences due to G+C content inhomogeneity and codon bias 
usage; pseudoknot proportions (mean ± std-dev.): E. coli, 15.5±6.5% 
(versus 16.5±7.9% for 50% G+C rich random sequences); yeast, 
14±6.6% (versus 15±7.3% for 40% G+C rich random sequences); 
Hence, genomic sequences appear to have maintained a large poten- 
tial for modulating the presence or absence of pseudoknots in their 3D 
structures. 

Overall, these results suggest that neglecting pseudoknots in RNA 
structure predictions is probably a stronger impediment than the small 
intrinsic inaccuracy of stacking energy parameters. In practice, combin- 
ing simple structural models (Fig 1C) and exactly clustered stochastic 
(ECS) simulations provides an effective approach to predict pseudo- 
knots in RNA structures. 
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